Model Diagnostics

We will be using the taxlot_sample data for this lab. Download taxlot_sample.csv

It’s generally a good idea to start your modeling work, including diagnostics, with descriptive statistics and visualization:

require(tidyverse)
taxlot_sample <- read_csv("data/taxlot_sample.csv")

set.seed(100)
taxlot_sfr <- taxlot_sample %>% 
  filter(sfr == 1)

require(ggplot2)

# histograms for outcome and predictor variables
ggplot(data=taxlot_sfr, aes(x=TOTALVAL)) + 
  geom_histogram()

ggplot(data=taxlot_sfr, aes(x=BLDGSQFT)) + 
  geom_histogram()

# boxplots for outcome and predictor variables
ggplot(data=taxlot_sfr, aes(x="TOTALVAL", y=TOTALVAL)) + 
  geom_boxplot()

ggplot(data=taxlot_sfr, aes(x="BLDGSQFT", y=BLDGSQFT)) + 
  geom_boxplot()

# scatter plot between outcome and predictor variable
ggplot(data=taxlot_sfr, aes(x=BLDGSQFT, y=TOTALVAL)) + 
  geom_point() + geom_smooth(method="lm", se=FALSE)

# Which observations do you suspect are outliers/influential observations?

In many cases, it is useful to find out which observations they are and what are the values of other variables for these observations. For these, interactive plots would be handy:

if (!require(plotly))
  install.packages("plotly") & require(plotly)

g <- ggplot(data=taxlot_sfr, aes(x=BLDGSQFT, y=TOTALVAL, text=FID)) + 
  geom_point() + geom_smooth(method="lm", se=FALSE)
ggplotly(g, tooltip=c("FID"))
# Now to identify the observation in the top-right corner, move your mouse to hover above the data point,
# and a tooltip will show its FID (the unique identifier) value: 1098, and we can get the row representing
# the observation:
taxlot_sfr %>% filter(FID==1098)
## # A tibble: 1 x 17
##     FID TOTALVAL BLDGSQFT YEARBUILT GIS_ACRES condo   apt   sfr   com dpioneer      dfwy      dpark      dmax nodedens  dbikehq slope10     dbusy
##   <int>    <int>    <int>     <int>     <dbl> <int> <int> <int> <int>    <dbl>     <dbl>      <dbl>     <dbl>    <dbl>    <dbl>   <int>     <dbl>
## 1  1098  1737570     5769      1920 0.4204119     0     0     1     0 1.459124 0.7571349 0.05611497 0.9215101      240 0.295714       1 0.1475817
#We can see it is probably a large house on a less than half acre lot, built in 1920 at a very good location (1.5 miles from Poineer square; 0.06 miles from a park, etc)
# Now to identify the observation in the top-right corner, move your mouse to hover above the data point,
# and a tooltip will show its FID (the unique identifier) value: 1098, and we can get the row representing
# the observation:
taxlot_sfr %>% filter(FID==1098)
## # A tibble: 1 x 17
##     FID TOTALVAL BLDGSQFT YEARBUILT GIS_ACRES condo   apt   sfr   com dpioneer      dfwy      dpark      dmax nodedens  dbikehq slope10     dbusy
##   <int>    <int>    <int>     <int>     <dbl> <int> <int> <int> <int>    <dbl>     <dbl>      <dbl>     <dbl>    <dbl>    <dbl>   <int>     <dbl>
## 1  1098  1737570     5769      1920 0.4204119     0     0     1     0 1.459124 0.7571349 0.05611497 0.9215101      240 0.295714       1 0.1475817
#We can see it is probably a large house on a less than half acre lot, built in 1920 at a very good location (1.5 miles from Poineer square; 0.06 miles from a park, etc)

Default R diagnostic plots

(Adapted from Joseph Broach’s 2016 course material)

  1. Residuals vs Fitted: Primarily a test of linearity of the relationship. This is a plot of residual errors (Y distance from SRL to observed value) against predicted values.
  • Expected: random cloud of points and a roughly horizontal (Red) indicator of conditional residual means given fitted values

  • Departures: residuals that drift up or down considerably as Y increases (wrong explanatory variables, nonlinearity, non-fixed parameters at population level)

  1. Normal Q-Q plot: Primarily a test of normality of the (expected) error distribution. This is a plot of standardized residuals ordered by quantile and compared with the values we would expect at each quantile if drawn from a normal distribution.
  • Expected: points more or less fall on the 45-degree line, matching expectations. There will be some departure in real world data, especially at the tails.

  • Departures: residuals that seem disinterested in following the line, or that snake noticeably from side to side (non-normality, could be omitted variable or nonlinear relationship)

  1. Scale-Location: Primarily a test of heteroskedasticity (constant error variance) of the relationship. This is a plot of square-rooted, absolute value residual errors (so all are positive) against predicted values of Y.
  • Expected: More or less horizontal fit curve with no strong evidence that expected error magnitude changes over the range of Y.

  • Departures: residuals that trend up or down considerably as Y increases (potential heteroskedasticity or autocorrelation in errors)

  1. Residuals vs Leverage: Primarily a test of the influence of specific data points on the sample regression. It attempts to summarize points by their influence on the regression. Cook’s distance is a measure of the relative impact on the SRL of removing a specific point.
  • Expected: red fit curve close to horizontal and no points outside the 0.5 Cook’s distance bands

  • Departures: specific influential observations (bad data or rare data that may be difficult to infer about from sample); this is less a specific departure from assumptions and more a warning that some sample points are “exploiting” our OLS estimation technique and possibly given too much weight as a result.

lm_sfr <- lm(TOTALVAL ~ YEARBUILT + BLDGSQFT + GIS_ACRES + dpioneer + dmax, data=taxlot_sfr)

summary(lm_sfr)
## 
## Call:
## lm(formula = TOTALVAL ~ YEARBUILT + BLDGSQFT + GIS_ACRES + dpioneer + 
##     dmax, data = taxlot_sfr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -549809  -40008   -3202   36715  626089 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1035340.37  197960.55   5.230 2.07e-07 ***
## YEARBUILT      -403.89     104.40  -3.869 0.000117 ***
## BLDGSQFT        135.61       3.89  34.856  < 2e-16 ***
## GIS_ACRES    295283.37   30999.29   9.525  < 2e-16 ***
## dpioneer     -39582.98    1686.66 -23.468  < 2e-16 ***
## dmax           3145.61    3317.25   0.948 0.343228    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86000 on 994 degrees of freedom
## Multiple R-squared:  0.7514, Adjusted R-squared:  0.7502 
## F-statistic:   601 on 5 and 994 DF,  p-value: < 2.2e-16
plot(lm_sfr, ask=FALSE)

Specific diagnostics with olsrr package

Outliers and Influential Observations

if (!require(olsrr))
  install.packages("olsrr") & require(olsrr)

# leverage (hat)
leverage <- ols_leverage(lm_sfr)
ols_rsdlev_plot(lm_sfr)

# Cook's distance
ols_cooksd_chart(lm_sfr)

# DFFITS
ols_dffits_plot(lm_sfr)

# DFBETAS
ols_dfbetas_panel(lm_sfr)

Heteroskedasticity

  • Residual vs Fitted Values Plot

It is a scatter plot of residuals on the y axis and fitted values on the x axis to detect non-linearity, unequal error variances, and outliers.

  • Characteristics of a well behaved residual vs fitted plot:
    • The residuals spread randomly around the 0 line indicating that the relationship is linear.
    • The residuals form an approximate horizontal band around the 0 line indicating homogeneity of error variance.
    • No one residual is visibly away from the random pattern of the residuals indicating that there are no outliers.
ols_rvsp_plot(lm_sfr)

  • Residual QQ Plot and normality test of residuals
ols_rsd_qqplot(lm_sfr)

# hypothesis test of normality of residuals
ols_norm_test(lm_sfr)
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.8991         0.0000 
## Kolmogorov-Smirnov        0.0897         0.0000 
## Cramer-von Mises         84.0093         0.0038 
## Anderson-Darling         16.9564         0.0000 
## -----------------------------------------------
  • Test of Heteroskedasticity with Breusch-Pagan Test
ols_bp_test(lm_sfr)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                 Data                 
##  ------------------------------------
##  Response : TOTALVAL 
##  Variables: fitted values of TOTALVAL 
## 
##           Test Summary           
##  --------------------------------
##  DF            =    1 
##  Chi2          =    861.8452 
##  Prob > Chi2   =    1.933814e-189
  • Heteroskedasticity-Consistent Standard Errors
# standard variance-covariance matrix
vcov0 <- vcov(lm_sfr)
# convert to correlation
vcov0
##               (Intercept)     YEARBUILT     BLDGSQFT     GIS_ACRES      dpioneer          dmax
## (Intercept) 39188379730.4 -2.064065e+07 133496.86296 -196310620.16 137427155.953  1.309633e+08
## YEARBUILT     -20640652.1  1.090022e+04    -83.49863     103217.56    -77830.449 -7.181746e+04
## BLDGSQFT         133496.9 -8.349863e+01     15.13591     -44496.71      2074.536 -3.194746e+02
## GIS_ACRES    -196310620.2  1.032176e+05 -44496.71348  960956091.44 -12825833.371 -1.305844e+07
## dpioneer      137427156.0 -7.783045e+04   2074.53629  -12825833.37   2844822.573 -5.215829e+05
## dmax          130963340.4 -7.181746e+04   -319.47462  -13058438.95   -521582.919  1.100413e+07
# Heteroskedasticity-Consistent variance covariance matrix
require(car)
vcov_hc3 <- hccm(lm_sfr, type="hc3")
vcov_hc3
##               (Intercept)     YEARBUILT      BLDGSQFT    GIS_ACRES      dpioneer          dmax
## (Intercept)  7.048578e+10 -3.709466e+07  -30315.91239 1131011898.5 269284975.137 239474701.266
## YEARBUILT   -3.709466e+07  1.957207e+04     -25.18230    -696437.1   -146022.539   -122310.287
## BLDGSQFT    -3.031591e+04 -2.518230e+01      61.75066    -124379.3      2257.507     -8035.445
## GIS_ACRES    1.131012e+09 -6.964371e+05 -124379.27896 5143372479.3 -65293301.459 -11178992.757
## dpioneer     2.692850e+08 -1.460225e+05    2257.50736  -65293301.5   4141156.696    616312.300
## dmax         2.394747e+08 -1.223103e+05   -8035.44545  -11178992.8    616312.300   7984659.477
# In presence of Heteroskedasticity, vcov_hc3 is larger than vcov0, to redo hypothesis tests
# with the Heteroskedasticity-Consistent variance covariance matrix

if (!require(lmtest))
  install.packages("lmtest") & library(lmtest)

coeftest(lm_sfr, vcov_hc3)
## 
## t test of coefficients:
## 
##                Estimate  Std. Error  t value  Pr(>|t|)    
## (Intercept)  1.0353e+06  2.6549e+05   3.8997 0.0001028 ***
## YEARBUILT   -4.0389e+02  1.3990e+02  -2.8870 0.0039738 ** 
## BLDGSQFT     1.3561e+02  7.8582e+00  17.2570 < 2.2e-16 ***
## GIS_ACRES    2.9528e+05  7.1717e+04   4.1173 4.151e-05 ***
## dpioneer    -3.9583e+04  2.0350e+03 -19.4513 < 2.2e-16 ***
## dmax         3.1456e+03  2.8257e+03   1.1132 0.2658876    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multicollinary with VIF

Variance inflation factors measure the inflation in the variances of the parameter estimates due to collinearities that exist among the predictors. The general rule of thumb is that VIFs exceeding 4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction.

ols_vif_tol(lm_sfr)
## # A tibble: 5 x 3
##   Variables Tolerance      VIF
##       <chr>     <dbl>    <dbl>
## 1 YEARBUILT 0.7155483 1.397530
## 2  BLDGSQFT 0.7939384 1.259544
## 3 GIS_ACRES 0.7985859 1.252213
## 4  dpioneer 0.6864766 1.456714
## 5      dmax 0.8759423 1.141628

Tasks

  1. Visualize the outcome variable and a numeric independent variable in your dataset (if your dataset does not have numeric variables, use the taxlot_sample data set)
    • Are there any potential outliers and influential observations?
    • Examine a few of the outliers and influential observations, is there a reason for them to be outliers and/or influential obs?
    • How do you plan to handle them?
  2. Run regressions of the outcome variable on a number of independent variables, and go through the default and customized diagnostic plots
    • Are there patterns in your Residuals vs Fitted plot? What may be the cause?
    • Are your residuals iid normal?
    • Does your model suffer from heteroskedasticity? If so, calculate the Heteroskedasticity-Consistent Standard Errors and re-do the t-tests on coefficients using the Heteroskedasticity-Consistent Standard Errors.
    • Does your model suffer from multicollinary? If so, how would you address it?